Accessing geospatial data the easy way (Python)

The access to geospatial data has changed significantly over the past decade. Data has traditionally been accessed by downloading several files to a local computer, then analyzing them with software or programming languages. It has always been difficult to access analysis-ready datasets due to the diversity of data formats (NetCDF, Grib2, Geotiff, Shapefile, etc.) and the variety of access protocols from different providers (Opendap, HTTPS, SFTP, WPS, API Rest, Datamarts, etc.). Beyond that, with the ever-increasing size of geospatial datasets, most modern datasets cannot even fit on a local computer, limiting science’s progress

The datasets presented here are large-scale analysis-ready cloud optimized (ARCO). In order to implement an entry point for a list of datasets, we have followed the methodology developed by the Pangeo community, which combines multiple technologies: - Data Lake (or S3, Azure Data Lake Storage, GCS, etc.) : distributed file-object storage - Zarr (or alternatively TileDB, COGs) : chunked N-dimensionnal array formats - Dask (or alternatively Spark, Ray, Distributed) : distributed computing and lazy loading - Intake Catalogs (or alternatively STAC) : a general interface for loading different data formats, mostly but not limited to spatiotemporal assets

For more information, please refer to the pangeo’s website

It is important to keep in mind that the majority of the datasets in the catalogue have language-agnostic formats, making them accessible through a variety of programming languages (including Python, Julia, Javascript, C, etc.) that implement the specifications for these formats (such as Zarr, netcdfs (kerchunk), geojson, etc.).

[1]:
from distributed import Client
import intake
import hvplot.xarray
import hvplot.pandas
from dask.distributed import PipInstall
import xoak
import xarray as xr
import numpy as np
import pandas as pd

Dask client

We use a Dask client to ensure all following code compatible with the framework run in parallel

[2]:
client = Client()
client
[2]:

Client

Client-76644c23-66a1-11ed-89d0-000d3aa4d991

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status

Cluster Info

Intake catalogs

Intake is a lightweight package for finding, investigating, loading and disseminating data. A cataloging system is used to organize a collection of datasets and data loaders (drivers) are parameterized such that datasets are opened in the desired format for the end user. In the python context, multi-dimensional xarrays could be opened with xarray’s drivers while polygons (shapefiles, geojson) could be opened with geopandas.

Here is the URL where you can access the catalog:

[3]:
catalog_url = 'https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml'
cat=intake.open_catalog(catalog_url)
cat
main:
  args:
    path: https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/main.yaml
  description: Master Data Catalog
  driver: intake.catalog.local.YAMLFileCatalog
  metadata: {}

In order to arrange the collection of datasets, the catalogue itself makes references to various sub-catalogs:

[4]:
[cat[field]
 for field in list(cat._entries.keys())]
[4]:
[<Intake catalog: hydrology>,
 <Intake catalog: atmosphere>,
 <Intake catalog: geography>,
 <Intake catalog: climate_change>]

Even though our catalogue is constantly expanding, some datasets are already available. The next sections contain several examples of queries as well as analyses of various ones.

The current (flattened) catalogue is described in the table below. A dataset should be used after consulting the status field. If a dataset has a “dev” flag, it signifies that we are actively working on it and do not recommend using it. It is production-ready if it has a “prod” flag. The “prod” label signifies that the dataset has undergone quality review and testing, however users should always double-check on their own because errors are still possible.

[5]:
pd.set_option('display.max_colwidth', None)

pd.DataFrame([[field ,
               dataset,
               cat[field][dataset].describe()['description'],
               cat[field][dataset].describe()['metadata']['status'][0]]
              for field in list(cat._entries.keys())
              for dataset in cat[field]._entries.keys()],
            columns=['field', 'dataset_name', 'description', 'status']) \
.sort_values('field')
[5]:
field dataset_name description status
1 atmosphere era5_reanalysis_single_levels ERA5 hourly estimates of variables on single levels chunked for time series analysis prod
2 atmosphere era5_reanalysis_single_levels_spatial ERA5 hourly estimates of variables on single levels chunked for spatial analysis dev
3 atmosphere era5_land_reanalysis_spatial ERA5-Land hourly estimates on single level chunked for spatial analysis dev
4 atmosphere era5_reanalysis_pressure_levels ERA5 hourly estimates of variables on pressure levels prod
5 atmosphere daymet_daily_na Daymet Data Version 4.0 prod
6 atmosphere ghcnd_world Global Historical Climatology Network daily (GHCNd) dev
7 atmosphere scdna SCDNA a serially complete precipitation and temperature dataset for North America from 1979 to 2018 prod
8 atmosphere 20_century_reanalysis_single_levels NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on single levels spanning 1836 to 2015 chunked for time series analysis prod
9 atmosphere 20_century_reanalysis_single_levels_large_area NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on single levels spanning 1836 to 2015 chunked for spatial analysis prod
10 atmosphere 20_century_reanalysis_pressure_levels NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on pressure levels spanning 1836 to 2015 chunked for time series analysis prod
11 atmosphere 20_century_reanalysis_pressure_levels_large_area NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) on pressure levels spanning 1836 to 2015 chunked for spatial analysis prod
12 atmosphere terraclimate TerraClimate is a dataset of monthly climate and climatic water balance for global terrestrial surfaces from 1958-2019 prod
14 climate_change rcp45_day_NAM_22i_raw_zarr NA-Cordex (limited to rcp45 for now... more to come!) dev
13 geography melcc_polygons MELCC basin delimitation dev
0 hydrology melcc CEHQ daily flow and water levels dev

1) Atmosphere datasets

a) ERA5 single levels

ERA5 is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5 is produced by the Copernicus Climate Change Service (C3S) at ECMWF.

Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.

Property

Values

Temporal extent:

01/01/1979 – 12/31/2020

Spatial extent:

World : [-180, 180, -90, 90]

Chunks (timeseries’s version):

{‘time’: 14880, ‘longitude’: 15, ‘latitude’: 15}

Chunks (spatial’s version):

{‘time’: 24, ‘longitude’: 1440, ‘latitude’: 721}

Spatial resolution:

0.25 degrees

Spatial reference:

WGS84 (EPSG:4326)

Temporal resolution:

1 hour

Update frequency:

In 2023, we will update it weekly

Data access

[6]:
ds=cat.atmosphere.era5_reanalysis_single_levels.to_dask()
ds
[6]:
<xarray.Dataset>
Dimensions:    (latitude: 721, longitude: 1440, time: 368184)
Coordinates:
  * latitude   (latitude) float32 90.0 89.75 89.5 89.25 ... -89.5 -89.75 -90.0
  * longitude  (longitude) float32 -180.0 -179.8 -179.5 ... 179.2 179.5 179.8
  * time       (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Data variables:
    t2m        (time, latitude, longitude) float32 dask.array<chunksize=(14880, 15, 15), meta=np.ndarray>
    tp         (time, latitude, longitude) float32 dask.array<chunksize=(14880, 15, 15), meta=np.ndarray>
Attributes:
    institution:  ECMWF
    source:       Reanalysis
    title:        ERA5 forecasts

Working with the data

We can quickly choose data subsets in both space and time using xarray. Here, we choose July 19–20, 1996, a period when Quebec saw historically extreme precipitation (Canada). The graphic package hvplot can then be used to track the storm throughout the event.

[7]:
%%time

da = ds.tp \
.sel(time=slice('1996-07-19','1996-07-20'),
     longitude=slice(-90,-50),
     latitude=slice(60,35))

da \
.where(da>=0.001) \
.load() \
.hvplot(groupby='time',
        widget_type='scrubber',
        widget_location='bottom',
        cmap='gist_ncar',
        tiles='ESRI',
        geo=True,
        clim=(0.001, 0.005),
        width=750,
        height=400)
CPU times: user 5.26 s, sys: 371 ms, total: 5.63 s
Wall time: 23.7 s
[7]:

Because this zarr’s version of ERA5 is optimised for time series analysis, all historical data can be quickly extracted on a relatively small spatial extent (a point or a polygon for instance) as opposed to working with a collection of netcdf files which is typically extremely compute-intensive for large datasets due to the netcdfs being chunked in the time dimension.

[8]:
%%time
da = (1000*ds.tp) \
.sel(longitude=-75,
     latitude=45,
     method='nearest')

da.hvplot(grid=True, width=800, height=500, color='blue')
CPU times: user 374 ms, sys: 27 ms, total: 401 ms
Wall time: 5.02 s
[8]:
[9]:
%%time
da = (1000*ds.tp) \
.sel(longitude=-75,
     latitude=45,
     method='nearest') \
.resample(time='1Y') \
.sum()

da.hvplot.line(grid=True, width=800, height=500, color='blue')* \
da.hvplot.scatter(marker='o').opts(color='black', size=14)
CPU times: user 1.23 s, sys: 107 ms, total: 1.33 s
Wall time: 10.8 s
[9]:

b) ERA5 pressure levels

ERA5 is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5 is produced by the Copernicus Climate Change Service (C3S) at ECMWF.

Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.

Property

Values

Temporal extent:

01/01/1979 – 12/31/2019

Spatial extent:

Atlantic Northeast : [-96, -52, 40, 63]

Chunks:

{‘time’: 8760, ‘longitude’: 25, ‘latitude’: 25, ‘level’: 1}

Spatial resolution:

0.25 degrees

Spatial reference:

WGS84 (EPSG:4326)

Temporal resolution:

1 hour

Update frequency:

None

[10]:
ds=cat.atmosphere.era5_reanalysis_pressure_levels.to_dask()
ds
[10]:
<xarray.Dataset>
Dimensions:    (latitude: 93, level: 6, longitude: 177, time: 359400)
Coordinates:
  * latitude   (latitude) float32 63.0 62.75 62.5 62.25 ... 40.5 40.25 40.0
  * level      (level) int32 300 400 500 700 850 1000
  * longitude  (longitude) float32 -96.0 -95.75 -95.5 ... -52.5 -52.25 -52.0
  * time       (time) datetime64[ns] 1979-01-01 ... 2019-12-31T23:00:00
Data variables:
    r          (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
    z          (time, level, latitude, longitude) float32 dask.array<chunksize=(8760, 1, 25, 25), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    history:      2019-12-18 03:49:32 GMT by grib_to_netcdf-2.14.0: /opt/ecmw...

Working with the data

[11]:
%%time
ds.z \
.sel(longitude=-75, latitude=45, level=[500, 700, 850, 1000]).hvplot(grid=True, by='level')
CPU times: user 1.7 s, sys: 202 ms, total: 1.9 s
Wall time: 25.3 s
[11]:

c) ERA5-Land

ERA5-Land is the fifth generation ECMWF atmospheric reanalysis of the global climate covering the period from January 1950 to present. ERA5-Land is produced by the Copernicus Climate Change Service (C3S) at ECMWF.

Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product.

Property

Values

Temporal extent:

01/01/1950 – present

Spatial extent:

North America : [-167, -50, 15, 85]

Chunks (timeseries’s version):

{‘time’: 8760, ‘longitude’: 7, ‘latitude’: 7}

Chunks (spatial’s version):

{‘time’: 24, ‘longitude’: 1171, ‘latitude’: 701}

Spatial resolution:

0.1 degrees

Spatial reference:

WGS84 (EPSG:4326)

Temporal resolution:

1 hour

Update frequency:

In 2023, we will update it monthly

Available in December 2022. Please refer to previous ERA5 examples once the dataset is added to the catalog.

d) Daymet

The Daymet dataset contains daily minimum temperature, maximum temperature, precipitation, shortwave radiation, vapor pressure, snow water equivalent, and day length at 1km resolution for North America. Annual and monthly summaries are also available. The dataset covers the period from January 1, 1980 to December 31, 2020.

Daymet is accessible on Azure in Zarr format; this notebook shows how to access the data using the Planetary Computer’s resources so that it can be read into a xarray dataset.

Property

Values

Temporal extent:

01/01/1980 – 12/31/2020

Spatial extent:

North America

Chunks (timeseries’s version):

{‘time’: 365, ‘longitude’: 584, ‘latitude’: 284}

Spatial resolution:

1 km

Spatial reference:

Custom (‘+ellps=WGS84 +proj=lcc +lon_0=-100 +lat_0=42.5 +x_0=0.0 +y_0=0.0 +lat_1=25 +lat_2=60 +no_defs’)

Temporal resolution:

1 day

Update frequency:

None

Data access

[12]:
ds=cat.atmosphere.daymet_daily_na.to_dask()
ds
[12]:
<xarray.Dataset>
Dimensions:                  (time: 14965, y: 8075, x: 7814, nv: 2)
Coordinates:
    lat                      (y, x) float32 dask.array<chunksize=(284, 584), meta=np.ndarray>
    lon                      (y, x) float32 dask.array<chunksize=(284, 584), meta=np.ndarray>
  * time                     (time) datetime64[ns] 1980-01-01T12:00:00 ... 20...
  * x                        (x) float32 -4.56e+06 -4.559e+06 ... 3.253e+06
  * y                        (y) float32 4.984e+06 4.983e+06 ... -3.09e+06
Dimensions without coordinates: nv
Data variables:
    dayl                     (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    lambert_conformal_conic  int16 ...
    prcp                     (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    srad                     (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    swe                      (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    time_bnds                (time, nv) datetime64[ns] dask.array<chunksize=(365, 2), meta=np.ndarray>
    tmax                     (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    tmin                     (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    vp                       (time, y, x) float32 dask.array<chunksize=(365, 284, 584), meta=np.ndarray>
    yearday                  (time) int16 dask.array<chunksize=(365,), meta=np.ndarray>
Attributes:
    Conventions:       CF-1.6
    Version_data:      Daymet Data Version 4.0
    Version_software:  Daymet Software Version 4.0
    citation:          Please see http://daymet.ornl.gov/ for current Daymet ...
    references:        Please see http://daymet.ornl.gov/ for current informa...
    source:            Daymet Software Version 4.0
    start_year:        1980

Working with the data

Because Daymet has a custom projection, we use xoak library to query some data. It is also possible to regrid or reproject the data to facilitate analysis.

[13]:
%%time
ds = ds.sel(time=slice('2000-01-01','2001-01-01'))
points = xr.Dataset(
    {
        "lat": 45,
        "lon": -75,
    }
)

da_tmax = ds.tmax
da_tmax.xoak.set_index(["lat", "lon"], "sklearn_geo_balltree")

da_tmin = ds.tmin
da_tmin.xoak.set_index(["lat", "lon"], "sklearn_geo_balltree")

prcp = ds.prcp
prcp.xoak.set_index(["lat", "lon"], "sklearn_geo_balltree")

swe = ds.swe
swe.xoak.set_index(["lat", "lon"], "sklearn_geo_balltree")

(da_tmax.xoak.sel(lat=points.lat,
                  lon=points.lon).hvplot(grid=True,
                                         value_label='daily temperature (degrees C)')* \
da_tmin.xoak.sel(lat=points.lat,
                 lon=points.lon).hvplot(grid=True) + \
prcp.xoak.sel(lat=points.lat,
              lon=points.lon).hvplot(grid=True) + \
swe.xoak.sel(lat=points.lat,
             lon=points.lon).hvplot(grid=True)
).cols(1)

CPU times: user 23.3 s, sys: 1.4 s, total: 24.7 s
Wall time: 4min 44s
[13]:

e) SCDNA (extent: North America)

Station-based serially complete datasets (SCDs) of precipitation and temperature observations are important for hydrometeorological studies. Motivated by the lack of serially complete station observations for North America, this study seeks to develop an SCD from 1979 to 2018 from station data. The new SCD for North America (SCDNA) includes daily precipitation, minimum temperature (Tmin), and maximum temperature (Tmax) data for 27 276 stations. Raw meteorological station data were obtained from the Global Historical Climate Network Daily (GHCN-D), the Global Surface Summary of the Day (GSOD), Environment and Climate Change Canada (ECCC), and a compiled station database in Mexico. Stations with at least 8-year-long records were selected, which underwent location correction and were subjected to strict quality control. Outputs from three reanalysis products (ERA5, JRA-55, and MERRA-2) provided auxiliary information to estimate station records

Property

Values

Temporal extent

01/01/1979 – 12/31/2018

Spatial extent

North America : [-177, -52, 7, 83]

Chunks

{‘time’: 1000, ‘ID’: 1000}

[14]:
ds=cat.atmosphere.scdna.to_dask()
ds
[14]:
<xarray.Dataset>
Dimensions:    (ID: 27276, time: 14610)
Coordinates:
  * ID         (ID) <U13 'GS91066022701' 'GHMQW00022701' ... 'ECCA008402568'
    elevation  (ID) float32 dask.array<chunksize=(1000,), meta=np.ndarray>
    latitude   (ID) float32 dask.array<chunksize=(1000,), meta=np.ndarray>
    longitude  (ID) float32 dask.array<chunksize=(1000,), meta=np.ndarray>
  * time       (time) datetime64[ns] 1979-01-01 1979-01-02 ... 2018-12-31
Data variables:
    prcp       (ID, time) float32 dask.array<chunksize=(1000, 1000), meta=np.ndarray>
    prcp_flag  (ID, time) float64 dask.array<chunksize=(1000, 1000), meta=np.ndarray>
    prcp_kge   (ID) float32 dask.array<chunksize=(1000,), meta=np.ndarray>
    sflag      (ID) <U3 dask.array<chunksize=(1000,), meta=np.ndarray>
    tmax       (ID, time) float32 dask.array<chunksize=(1000, 1000), meta=np.ndarray>
    tmax_flag  (ID, time) float64 dask.array<chunksize=(1000, 1000), meta=np.ndarray>
    tmax_kge   (ID) float32 dask.array<chunksize=(1000,), meta=np.ndarray>
    tmin       (ID, time) float32 dask.array<chunksize=(1000, 1000), meta=np.ndarray>
    tmin_flag  (ID, time) float64 dask.array<chunksize=(1000, 1000), meta=np.ndarray>
    tmin_kge   (ID) float32 dask.array<chunksize=(1000,), meta=np.ndarray>
[15]:
%%time
ds.prcp \
.sel(time=slice('1996-07-19','1996-07-20')) \
.sum('time') \
.to_dataframe() \
.replace({0:np.nan}) \
.dropna(how='any') \
.hvplot.points(x='longitude',
               y='latitude',
               color='prcp',
               geo=True,
               alpha=0.5,
               xlim=(-180,-30),
               ylim=(0,72),
               tiles='ESRI',
               cmap='gist_ncar',
               clim=(0,100),
               hover_cols=['ID','prcp'],
               width=700,
               height=400,
               title=f'48h precipitation during Saguenay flood event')
CPU times: user 675 ms, sys: 53.9 ms, total: 729 ms
Wall time: 8.17 s
[15]:

f) 20 Century reanalysis - single levels (extent : Atlantic Northeast)

Using a state-of-the-art data assimilation system and surface pressure observations, the NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) project has generated a four-dimensional global atmospheric dataset of weather spanning 1836 to 2015 to place current atmospheric circulation patterns into a historical perspective.

Property

Values

Temporal extent:

01/01/1836 – 12/31/2015

Spatial extent:

Atlantic Northeast [-96, -52, 40, 63]

Chunks

{‘time’: 32872, ‘longitude’: 6, ‘latitude’: 3}

Spatial resolution:

1 degrees

Spatial reference:

WGS84 (EPSG:4326)

Temporal resolution:

3 hours

Update frequency:

None

Data access

[16]:
ds=cat.atmosphere['20_century_reanalysis_single_levels'].to_dask()
ds
[16]:
<xarray.Dataset>
Dimensions:    (time: 525952, latitude: 24, longitude: 45)
Coordinates:
  * latitude   (latitude) float32 40.0 41.0 42.0 43.0 ... 60.0 61.0 62.0 63.0
  * longitude  (longitude) float32 -96.0 -95.0 -94.0 -93.0 ... -54.0 -53.0 -52.0
  * time       (time) datetime64[ns] 1836-01-01 ... 2015-12-31T21:00:00
Data variables:
    apcp       (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    cape       (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    crain      (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    pr_wtr     (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    prate      (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    tcdc       (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    tmax       (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
    tmin       (time, latitude, longitude) float32 dask.array<chunksize=(32872, 3, 6), meta=np.ndarray>
Attributes: (12/24)
    Conventions:               CF-1.2
    References:                https://www.psl.noaa.gov/data/gridded/data.20t...
    assimilation_algorithm:    Ensemble Kalman Filter with 4DIAU
    citation:                  Compo,G.P. <https://www.psl.noaa.gov/people/gi...
    citation1:                 Slivinski, L. C, G. P. Compo, J. S. Whitaker, ...
    comments:                  Data are from \nNOAA/CIRES/DOE 20th Century Re...
    ...                        ...
    product:                   reanalysis
    source:                    20CRv3si 2018, Ensemble Kalman Filter, ocean (...
    spatial_resolution:        1.0 degree
    standard_name_vocabulary:  NetCDF Climate and Forecast (CF) Metadata Conv...
    title:                     8x Daily NOAA/CIRES/DOE 20th Century Reanalysi...
    version:                   3si

Working with the data

Here we compute a simple line plot :

[17]:
%%time
ds.sel(latitude=45,
       longitude=-75) \
.prate \
.hvplot(grid=True)
CPU times: user 260 ms, sys: 19.2 ms, total: 279 ms
Wall time: 5.02 s
[17]:

g) 20 Century reanalysis - single levels (large area : for analysis in space)

Using a state-of-the-art data assimilation system and surface pressure observations, the NOAA-CIRES-DOE Twentieth Century Reanalysis (20CR) project has generated a four-dimensional global atmospheric dataset of weather spanning 1836 to 2015 to place current atmospheric circulation patterns into a historical perspective.

Property

Values

Temporal extent:

01/01/1836 – 12/31/2015

Spatial extent:

Atlantic Northeast [-96, -52, 40, 63]

Chunks

{‘time’: 100, ‘longitude’: 45, ‘latitude’: 24}

Spatial resolution:

1 degrees

Spatial reference:

WGS84 (EPSG:4326)

Temporal resolution:

3 hours

Update frequency:

None

[18]:
ds=cat.atmosphere['20_century_reanalysis_single_levels_large_area'].to_dask()
ds
[18]:
<xarray.Dataset>
Dimensions:    (time: 525952, latitude: 24, longitude: 45)
Coordinates:
  * latitude   (latitude) float32 40.0 41.0 42.0 43.0 ... 60.0 61.0 62.0 63.0
  * longitude  (longitude) float32 -96.0 -95.0 -94.0 -93.0 ... -54.0 -53.0 -52.0
  * time       (time) datetime64[ns] 1836-01-01 ... 2015-12-31T21:00:00
Data variables:
    apcp       (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    cape       (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    crain      (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    pr_wtr     (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    prate      (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    tcdc       (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    tmax       (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
    tmin       (time, latitude, longitude) float32 dask.array<chunksize=(100, 24, 45), meta=np.ndarray>
Attributes: (12/24)
    Conventions:               CF-1.2
    References:                https://www.psl.noaa.gov/data/gridded/data.20t...
    assimilation_algorithm:    Ensemble Kalman Filter with 4DIAU
    citation:                  Compo,G.P. <https://www.psl.noaa.gov/people/gi...
    citation1:                 Slivinski, L. C, G. P. Compo, J. S. Whitaker, ...
    comments:                  Data are from \nNOAA/CIRES/DOE 20th Century Re...
    ...                        ...
    product:                   reanalysis
    source:                    20CRv3si 2018, Ensemble Kalman Filter, ocean (...
    spatial_resolution:        1.0 degree
    standard_name_vocabulary:  NetCDF Climate and Forecast (CF) Metadata Conv...
    title:                     8x Daily NOAA/CIRES/DOE 20th Century Reanalysi...
    version:                   3si

Working with the data

[19]:
%%time
ds.sel(time='2000-01-01T00:00') \
.tmax \
.hvplot(grid=True,
        cmap='cwr',
        geo=True,
        tiles='CartoLight',
        alpha=0.75,
        width=700,
        height=400,)
CPU times: user 118 ms, sys: 0 ns, total: 118 ms
Wall time: 118 ms
[19]:

Other datasets :

The previous examples can be applied to the following datasets as well. We will let the end user experiment with them!

[20]:
ds=cat.atmosphere['20_century_reanalysis_pressure_levels'].to_dask()
ds
[20]:
<xarray.Dataset>
Dimensions:    (time: 525952, level: 17, latitude: 24, longitude: 45)
Coordinates:
  * latitude   (latitude) float32 40.0 41.0 42.0 43.0 ... 60.0 61.0 62.0 63.0
  * level      (level) float64 1.0 5.0 10.0 20.0 ... 700.0 800.0 900.0 1e+03
  * longitude  (longitude) float32 -96.0 -95.0 -94.0 -93.0 ... -54.0 -53.0 -52.0
  * time       (time) datetime64[ns] 1836-01-01 ... 2015-12-31T21:00:00
Data variables:
    air        (time, level, latitude, longitude) float32 dask.array<chunksize=(29200, 1, 24, 25), meta=np.ndarray>
    hgt        (time, level, latitude, longitude) float32 dask.array<chunksize=(29200, 1, 24, 25), meta=np.ndarray>
    omega      (time, level, latitude, longitude) float32 dask.array<chunksize=(29200, 1, 24, 25), meta=np.ndarray>
    rhum       (time, level, latitude, longitude) float32 dask.array<chunksize=(29200, 1, 24, 25), meta=np.ndarray>
Attributes: (12/25)
    Conventions:                     CF-1.2
    DODS_EXTRA.Unlimited_Dimension:  time
    References:                      https://www.esrl.noaa.gov/psd/data/gridd...
    assimilation_algorithm:          Ensemble Kalman Filter with 4DIAU
    citation:                        Compo,G.P. <https://www.esrl.noaa.gov/ps...
    citation1:                       Slivinski, L. C, G. P. Compo, J. S. Whit...
    ...                              ...
    product:                         reanalysis
    source:                          20CRv3si 2018, Ensemble Kalman Filter, o...
    spatial_resolution:              1.0 degree
    standard_name_vocabulary:        NetCDF Climate and Forecast (CF) Metadat...
    title:                           8x Daily NOAA/CIRES/DOE 20th Century Rea...
    version:                         3si
[21]:
ds=cat.atmosphere['20_century_reanalysis_pressure_levels_large_area'].to_dask()
ds
[21]:
<xarray.Dataset>
Dimensions:    (time: 525952, level: 17, latitude: 24, longitude: 45)
Coordinates:
  * latitude   (latitude) float32 40.0 41.0 42.0 43.0 ... 60.0 61.0 62.0 63.0
  * level      (level) float64 1.0 5.0 10.0 20.0 ... 700.0 800.0 900.0 1e+03
  * longitude  (longitude) float32 -96.0 -95.0 -94.0 -93.0 ... -54.0 -53.0 -52.0
  * time       (time) datetime64[ns] 1836-01-01 ... 2015-12-31T21:00:00
Data variables:
    air        (time, level, latitude, longitude) float32 dask.array<chunksize=(100, 1, 24, 45), meta=np.ndarray>
    hgt        (time, level, latitude, longitude) float32 dask.array<chunksize=(100, 1, 24, 45), meta=np.ndarray>
    omega      (time, level, latitude, longitude) float32 dask.array<chunksize=(100, 1, 24, 45), meta=np.ndarray>
    rhum       (time, level, latitude, longitude) float32 dask.array<chunksize=(100, 1, 24, 45), meta=np.ndarray>
Attributes: (12/25)
    Conventions:                     CF-1.2
    DODS_EXTRA.Unlimited_Dimension:  time
    References:                      https://www.esrl.noaa.gov/psd/data/gridd...
    assimilation_algorithm:          Ensemble Kalman Filter with 4DIAU
    citation:                        Compo,G.P. <https://www.esrl.noaa.gov/ps...
    citation1:                       Slivinski, L. C, G. P. Compo, J. S. Whit...
    ...                              ...
    product:                         reanalysis
    source:                          20CRv3si 2018, Ensemble Kalman Filter, o...
    spatial_resolution:              1.0 degree
    standard_name_vocabulary:        NetCDF Climate and Forecast (CF) Metadat...
    title:                           8x Daily NOAA/CIRES/DOE 20th Century Rea...
    version:                         3si
[22]:
ds=cat.atmosphere['terraclimate'].to_dask()
ds
[22]:
<xarray.Dataset>
Dimensions:                 (time: 744, lat: 4320, lon: 8640, crs: 1)
Coordinates:
  * crs                     (crs) int16 3
  * lat                     (lat) float64 89.98 89.94 89.9 ... -89.94 -89.98
  * lon                     (lon) float64 -180.0 -179.9 -179.9 ... 179.9 180.0
  * time                    (time) datetime64[ns] 1958-01-01 ... 2019-12-01
Data variables: (12/18)
    aet                     (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    def                     (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    pdsi                    (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    pet                     (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    ppt                     (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    ppt_station_influence   (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    ...                      ...
    tmin                    (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    tmin_station_influence  (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    vap                     (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    vap_station_influence   (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    vpd                     (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
    ws                      (time, lat, lon) float32 dask.array<chunksize=(12, 1440, 1440), meta=np.ndarray>
[23]:
# A new dataset with all rcps is being created and will replace this one
ds=cat.climate_change['rcp45_day_NAM_22i_raw_zarr'].to_dask()
ds

[23]:
<xarray.Dataset>
Dimensions:    (lat: 258, lon: 600, member_id: 3, time: 34698, bnds: 2)
Coordinates:
  * lat        (lat) float64 12.12 12.38 12.62 12.88 ... 75.62 75.88 76.12 76.38
  * lon        (lon) float64 -171.9 -171.6 -171.4 ... -22.62 -22.38 -22.12
  * member_id  (member_id) <U20 'CanESM2.CRCM5-OUR' ... 'GFDL-ESM2M.CRCM5-OUR'
  * time       (time) datetime64[ns] 2006-01-01T12:00:00 ... 2100-12-31T12:00:00
    time_bnds  (time, bnds) datetime64[ns] dask.array<chunksize=(17349, 2), meta=np.ndarray>
Dimensions without coordinates: bnds
Data variables: (12/15)
    hurs       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    huss       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    pr         (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    prec       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    ps         (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    rsds       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    ...         ...
    tasmin     (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    temp       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    tmax       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    tmin       (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    uas        (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
    vas        (member_id, time, lat, lon) float32 dask.array<chunksize=(3, 1000, 65, 120), meta=np.ndarray>
Attributes: (12/23)
    CORDEX_domain:                  NAM-22
    contact:                        {"GFDL-ESM2M.CRCM5-OUR": "biner.sebastien...
    creation_date:                  {"GFDL-ESM2M.CRCM5-OUR": "2019-02-12 15:2...
    driving_experiment:             {"GFDL-ESM2M.CRCM5-OUR": "GFDL-ESM2M,rcp4...
    driving_experiment_name:        rcp45
    driving_model_ensemble_member:  {"GFDL-ESM2M.CRCM5-OUR": "r1i1p1", "CanES...
    ...                             ...
    references:                     {"GFDL-ESM2M.CRCM5-OUR": "http://www.oura...
    title:                          {"GFDL-ESM2M.CRCM5-OUR": "NA-CORDEX Raw N...
    tracking_id:                    {"GFDL-ESM2M.CRCM5-OUR": "5139ec82-c55f-4...
    version:                        {"GFDL-ESM2M.CRCM5-OUR": "1.1", "CanESM2....
    zarr-dataset-reference:         For dataset documentation, see DOI https:...
    zarr-version:                   1.0
[24]:
# Sample from melcc hydrometric data. Needs to be completed and add data from other providers as well.
ds=cat.hydrology['melcc'].to_dask()
ds

[24]:
<xarray.Dataset>
Dimensions:                 (basin_id: 470, time: 41007)
Coordinates: (12/16)
    _last_update_timestamp  (basin_id) datetime64[ns] dask.array<chunksize=(470,), meta=np.ndarray>
    aggregation             (basin_id) <U1 dask.array<chunksize=(470,), meta=np.ndarray>
  * basin_id                (basin_id) <U6 '010101' '010801' ... '135201'
    data_type               (basin_id) <U1 dask.array<chunksize=(470,), meta=np.ndarray>
    drainage_area           (basin_id) float32 dask.array<chunksize=(470,), meta=np.ndarray>
    end_date                (basin_id) datetime64[ns] dask.array<chunksize=(470,), meta=np.ndarray>
    ...                      ...
    regulated               (basin_id) <U1 dask.array<chunksize=(470,), meta=np.ndarray>
    source                  (basin_id) <U1 dask.array<chunksize=(470,), meta=np.ndarray>
    start_date              (basin_id) datetime64[ns] dask.array<chunksize=(470,), meta=np.ndarray>
  * time                    (time) datetime64[ns] 1910-01-01 ... 2022-04-09
    timestep                (basin_id) <U1 dask.array<chunksize=(470,), meta=np.ndarray>
    units                   (basin_id) <U1 dask.array<chunksize=(470,), meta=np.ndarray>
Data variables:
    flag                    (time, basin_id) <U1 dask.array<chunksize=(2563, 59), meta=np.ndarray>
    value                   (time, basin_id) float32 dask.array<chunksize=(5126, 59), meta=np.ndarray>